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Abstract 


This  paper  compares  three  distributed  algorithms  for  factoring  a  large  sparse 
symmetric  positive  definite  matrix  on  a  local-memory  parallel  processor.  The 
fan-out,  fan-in,  and  distributed  mult ifrontal  schemes  are  presented  in  a  unified 
framework  which  highlights  their  communication  requirements  and  their  in¬ 
nermost  loops.  Experimental  results  on  an  Intel  iPSC’/2  hypercube  illustrate 
'heir  relative  performance. 
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1  Introduction 


The  solution  of  large  sparse  systems  of  linear  equations  is  an  important  application 
of  parallel  computers.  In  this  paper,  we  give  a  unified  presentation  and  compare  the 
performance  of  three  distributed  schemes  to  compute  the  Cholesky  factor  of  a  large 
sparse  symmetric  positive  definite  matrix  on  a  local-memory  parallel  processor. 

The  three  distributed  schemes  are  column-based  —  we  assume  that  each  processor 
has  been  assigned  a  set  of  columns  in  the  matrix  and  is  responsible  for  computing 
the  corresponding  set  of  columns  in  the  Cholesky  factor.  The  fan-out  method  [11]  is 
a  sparse  analog  of  the  parallel  outer-product  algorithm  for  factoring  dense  matrices 
[10].  The  fan- in  method  [l]  is  a  parallel  analog  of  the  standard  algorithm  for  factoring 
sparse  matrices  [12]  which  uses  a  distributed  fan-in  scheme  (cf.  [19]).  The  distributed 
multifrontal  method  [5,  6,  7,  16.  18]  is  a  parallel  implementation  of  the  multifrontal 
algorithm  [21 ,  8]. 

The  performance  of  these  schemes  depends  on  the  ordering  of  the  variables  and 
equations  and  the  mapping  of  columns  to  processors,  but  we  do  not  consider  these 
issues  in  this  paper.  We  simply  assume  that  the  matrix  has  been  ordered  by  some  fill- 
reducing  ordering  which  is  appropriate  for  parallel  elimination,  and  that  the  columns 
of  the  permuted  matrix  have  been  assigned  to  processors  in  a  manner  which  is  con¬ 
sistent  with  efficiency. 

In  §2,  we  review  Cholesky  factorization  by  columns.  In  §3,  we  review  elimination 
trees  and  introduce  domains  and  separators,  two  partitions  of  the  matrix  problem 
based  on  the  structure  of  the  elimination  tree  and  the  mapping  function.  In  §4.  we 
describe  the  fan-out  algorithm  and  an  improved  variant,  the  domain  fan-out  algorithm 
[3.  22j.  which  offers  substantial  reductions  in  message  traffic.  In  §5,  we  describe  the 
fan-in  algorithm.  In  §6,  we  describe  the  distributed  multifrontal  algorithm  in  the 
same  terms  and  then  relate  this  to  the  more  standard  description  in  terms  of  frontal 
matrices. 

In  §7.  we  compare  the  three  distributed  schemes  on  an  Intel  iPSC/2  hvpercube. 
The  fan-in  and  multifrontal  schemes  are  shown  to  require  significantly  less  message 
traffic  than  the  domain  fan-out  scheme  on  a  model  problem.  In  terms  of  overall  par¬ 
allel  factorization  time  for  our  current  implementations,  the  distributed  multifrontal 
method  is  better  than  the  fan-in  scheme,  which  is,  in  turn,  better  than  the  domain 
fan-out  scheme. 

Throughout  this  paper,  we  assume  that  there  are  p  processors  interconnected 
by  some  underlying  network  and  communicating  among  themselves  only  through 
messages.  We  assume  that  the  jith  column  is  assigned  to  processor  map[j],  which  is 
also  responsible  for  computing  the  jth  column  of  the  Cholesky  factor.  We  use  r  to 
represent  any  of  the  p  processors,  and  myname  to  denote  the  processor  on  which  the 
code  is  running. 
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Algorithm  1  Column-Cholesky  Factorization 
for  column  j  :=  1  to  n  do 
begin 


end 


(1) 


factor,  with  entries  a,}  and  ltJ.  respectively.  As  shown  in  Algorithm  1,  the  column- 
Choleskv  method  computes  L  column  by  column.  The  temporary  vector  {t:,  ■  ■  ■ ,  )T 
is  used  only  for  clarity;  its  storage  can  overlap  completely  with  that  for  (ln,  ■  ■  ■ ,  lnj  )T ■ 
Our  formulation  is  applicable  to  both  densp  and  sparse  matrices.  In  the  sparse 
case,  both  column  j  of  A  and  the  updates  Ijki'jk,  ■  ■  ■  Jnk)T  are  sparse  so  that  the 
vector  operations  can  be  performed  in  a  manner  which  takes  advantage  of  sparsity. 
The  columns  contributing  updates  to  the  sum  are  given  precisely  by  the  nonzero 
structure  of  row  j  of  L: 


StructiLj.)  =  {k  |  k  <  j.  l3k  ^  0}  . 

There  are  many  ways  to  implement  equation  (1)  in  a  distributed  manner;  the 
algorithms  we  consider  differ  in  the  order  in  which  and  the  processor  on  which  updates 
are  computed  and  applied.  As  an  aid  in  describing  these  schemes,  we  now  introduce 
four  related  notions  [l]: 

•  factor  column  L.}  =  (ln.  •  •  • ,  /nj)T; 

•  update  column  l}k{l}k-  •  •  • , lnk)T ,  where  l3k  ^  0; 

•  complete  update  column  Ekestmctdj.)  hk(hk- •  ■  • ,  Lk)T\ 

•  aggregate  update  column  where  h  C  StructyLj.). 

Update  columns  and  complete  update  columns  are  specie1  cas°s  of  aggregate  update 
columns  where  K  =  {fc}  and  K  =  Struct{L} .),  respectively. 
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Figure  1:  A  nested  dissection  ordering  of  a  7-by-7  grid. 


3  Elimination  Trees,  Domains,  and  Separators 

3.1  Elimination  Trees 

If  k  <  j  and  ^  0  (i.e.,  k  £  Struct{L]m)),,  then  column  j  depends  on  column  k.  The 
graph  which  represents  this  precedence  relation  is  precisely  the  directed  graph  of  the 
matrix  LT .  Taking  the  transitive  reduction  yields  the  elimination  tree  [14.  15,  20]. 

More  directly,  the  elimination  tree  of  the  matrix  A  is  a  tree’  with  n  nodes 
{l.  •  •  •  ,n}  such  that  node  j  is  the  parent  of  node  k  if  and  only  if 

J  =  min{?  |  k  <  i,  l,k  ±  0}. 

We  will  use  j  to  refer  to  both  a  column  of  the  matrix  and  the  corresponding  node  in 
the  elimination  tree.  Figure  1  shows  a  nested  dissection  ordering  of  a  9-point  finite 
difference  operator  on  a  7  x  7  grid.  Figure  2  shows  the  corresponding  elimination 
tree. 

\Vithout  loss  of  generality,  we  will  assume  that  the  nodes  are  ordered  in  what 
corresponds  to  a  postorder  traversal  of  the  elimination  tree  (i.e.,  each  of  the  subtrees 
of  a  node  is  numbered  and  then  the  node  itself  is  numbered  as  in  Figure  2).  Such  an 
ordering  is  equivalent2  to  any  other  ordering  which  generates  the  same  elimination 
tree  [15]. 

3.2  Domains 

Let  T[x]  denote  the  subtree  of  the  elimination  tree  /  rooted  at  the  node  x  (tV-o  is, 
the  node  x  together  with  all  of  its  descendants  in  T ),  and  let  parent[x ]  denote  the 

*lf  A  is  reducible,  then  the  elimination  “tree”  is  actually  a  forest 

JHere  equivalence  is  in  terms  of  the  number  of  nonzeroes  in  L ,  the  amount  of  work  to  factor  A, 
and  the  operations  required  to  compute  the  factor 
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Figure  2:  The  elimination  tree  for  the  ordering  in  Figure  1. 


parent  of  node  x  in  the  tree.  If  every  node  y  in  T[x]  is  mapped  to  the  same  processor 
as  i  (i.e..  map{y)  =  map\x]  for  all  y  6  T[x]j,  and  this  property  is  not  satisfied  by 
T:j>arent{i}\.  then  the  nodes  in  T[x]  are  said  to  form  a  domain  3 

In  Figure  3  the  tree  structure  of  Figure  2  has  been  annotated  to  illustrate  features 
corresponding  to  solution  on  a  four-processor  hypercube.  The  labels  "P,”  show  a  pos¬ 
sible  mapping  of  the  nodes/columns  to  the  processors,  and  the  domains  are  indicated 
by  heavy  outlines. 

There  are  four  domains,  given  by  the  subtrees  rooted  at  nodes  9,  18,  30.  and  39. 
respectively.  Figure  4  shows  the  same  7x7  grid  as  Figure  1,  with  the  four  domains 
identified  by  heavy  outlines. 

Since  the  columns  associated  with  a  domain  depend  only  on  columns  within  the 
same  domain,  these  factor  columns  can  be  computed  by  the  processor  to  which  the 
domain  is  mapped  without  any  interprocessor  communication.  Mu  and  Rice  [17]  have 
examined  such  mappings  and  shown  that  the  total  communication  requirements  are 
lower  than  with  other  subtree-to-subcube  mappings  for  fan-out  type  methods. 

Since  we  have  assumed  that  the  nodes  in  the  elimination  tree  are  ordered  by  a 
postorder  traversal,  the  nodes  within  a  domain  are  numbered  consecutively;  i.e.,  each 
domain  corresponds  to  a  block  of  contiguous  columns  in  the  matrix. 

3An  alternate  definition  is  that  a  domain  is  a  connected  subset  R  of  nodes  in  the  graph  of  .4 
satisfying 

•  if  y  and  z  are  in  R,  then  map[y]  =  map[c]; 

•  if  y  and  u  are  adjacent  and  y  is  in  R  but  u  is  not  in  R,  then  y  is  ordered  before  u; 

•  no  superset  of  R  satisfies  these  properties 
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Figure  3:  The  elimination  tree  showing  processor  mapping,  domains  (heavy  outlines), 
and  separators  (shaded  rectangles). 
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Figure  4:  The  7x7  grid  showing  domains  (heavy  outlines)  and  separators  (shaded 
rectangles). 
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3.3  Separators 

While  domains  group  nodes  based  on  the  processor  to  which  they  are  assigned,  it 
is  also  useful  to  group  nodes  based  on  the  structure  of  their  factor  columns. 

A  maximal  subset  5  of  nodes/columns  which  are  pairwise  connected  in  the  graph 
of  L  and  which  have  identical  Cholesky  column  structures  outside  5  is  said  to  form 
a  ( supemodat)  separator  [4].  In  Figure  4  the  top-level  separators  are  identified  by 
shading.  In  terms  of  the  elimination  tree,  each  separator  is  a  chain  of  nodes  (although 
not  all  chains  are  separators).  In  Figure  3,  the  top-level  separators  are  identified  by 
shading. 

Just  as  we  defined  the  parent  of  a  node  in  the  elimination  tree,  we  can  define  the 
parent  parent [5]  of  a  separator  5  as  the  separator  S'  containing  the  parent  of  the 
highest  node  in  S. 

Since  we  have  assumed  that  the  nodes  have  been  ordered  by  a  postorder  traversal, 
a  separator  corresponds  to  a  block  of  contiguous  columns  in  the  Cholesky  factor,  where 
the  diagonal  block  is  dense  triangular  and  the  off-block-diagonal  column  structures 
are  identical. 


4  The  Fan-out  Distributed  Algorithm 

The  fan-out  algorithm  [11]  is  a  distributed  sparse  factorization  scheme  in  which  all 
communication  is  through  factor  columns  and  all  updates  to  column  j  are  computed 
and  applied  by  processor  map\j).  The  implementation  in  Algorithm  2  works  for  any 
mapping  and  any  network  topology.  For  simplicity,  we  have  allowed  processors  to  send 
factor  columns  to  themselves;  in  practice,  the  corresponding  updates  are  computed 
and  applied  as  soon  as  the  factor  column  has  been  sent  to  other  processors. 

Most  of  the  floating-point  operations  are  performed  in  the  assignment  (2).  This  is 
a  sparse  AXPY,  adding  a  multiple  (a  =  —  l,k)  of  a  sparse  vector  (x  =  lnk)T)  to 

another  sparse  vector  (y  —  L„).  It  can  be  implemented  efficiently  by  expanding  y  into 
a  dense  vector  t,  sparsely  adding  ax  (which  is  computed  with  a  scalar-dense- vector 
multiply),  and  then  compressing  the  result  back  into  y.  The  overhead  (the  expansion 
and  compression)  can  be  reduced  significantly  by  saving  the  factor  columns  so  that  all 
of  the  updates  to  a  column  are  computed  and  applied  at  the  same  time.  Unfortunately, 
this  also  increases  the  storage  requirements. 

If  one  processor  owns  ail  of  the  columns  in  some  domain  R,  then  that  processor 
can  compute  the  factor  columns  corresponding  to  R  without  any  interprocessor  com¬ 
munication.  There  are  two  ways  to  send  the  resulting  updates  to  the  other  processors: 

•  send  factor  columns,  as  above; 

•  scud  aggregate  update  columns  associated  with  the  domain  R  (each  aggregate 
update  is  sent  to  only  one  processor). 

We  will  refer  to  this  second  variant  of  the  fan-out  algorithm  as  the  domain  fan-out 
algorithm  (cf.  [3],  [22]). 
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Algorithm  2  Fan-out  Distributed  Factorization 
mycols  -  {i  |  mapft]  =  myname}  ; 
for  j  £  mycols  do 

L.j  (Ujj . anj)T  : 

for  column  j  :=  1  to  n  do 
if  j  £  mycols  then 
begin 

while  not  all  updates  to  column  j  have  been  applied  do 

begin 

Receive  a  factor  column  L.k  ; 

for  each  i  >  k  with  l,k  ^  0  and  map[i]  =  myname  do 

L.,  :=  Lmi  -  l,k(l,k . U)T  \  (2) 


end  ; 


Send  LmJ  to  each  processor  ir  for  which  there  exists  i  >  j 
such  that  ltJ  ^  0  and  map[i]  =  tt  ; 

end 


The  computation  proceeds  in  two  phases.  In  the  first  phase,  each  domain  is 
factored  as  before,  except  that  no  factor  columns  are  sent  to  other  processors.  When 
the  root  j  of  the  defining  subtree  for  a  domain  is  reached,  the  processor  computes 
an  aggregate  update  column  corresponding  to  each  column  ?  >  j  for  which  lv  ^  0. 
and  sends  it  to  processor  maph j  (again,  for  simplicity  we  have  allowed  processors  to 
send  aggregate  updates  to  themselves).  Finally  all  of  the  updates  are  applied.  In  the 
second  phase,  the  algorithm  proceeds  as  before. 

Note  that  aggregate  update  columns  are  used  to  communicate  between  domain 
and  non-domain  nodes,  and  factor  columns  are  used  to  communicate  among  non¬ 
domain  nodes.  As  we  will  see  in  §7,  the  amount  of  communication  is  significantly 
reduced. 


5  The  Fan-in  Distributed  Algorithm 

The  fan-in  algorithm  [l]  is  a  distributed  scheme  for  sparse  factorization  in  which 
all  communication  is  through  aggregate  update  columns.  In  essence,  the  complete 
update  column  for  column  j  is  written  as  a  sum  of  aggregate  update  columns  (each 
corresponding  to  a  different  processor  7r ): 

Y,  I,k{ljk,---,lnk)T  =  Y  «[;>]• 

fc€Strucl(L,.)  » 

rou>[ J,*]#? 


7 


Algorithm  3  Fan-in  Distributed  Factorization 
mycols  =  {j  I  map\j\  =  myname}  ; 
for  column  j  1  to  n  do 

if  rou'[  j.  myname}  ^  0  or  j  6  mycols  then 
begin 
t  :=  0  ; 

for  k  d  row}  j.  my  name]  do 

/  :=  t  +  ljk(l}kwJnk)T  ;  (3) 

if  j  g  mycols  then 

Send  aggregate  update  column  /  to  processor  map [j] 

else 

begin 

L.j  :=  ( a,:  .  ■  ■  -  .and7  -  t  ; 

while  not  all  aggregate  updates  have  bt*'n  received  do 

begin 

Receive  aggregate  update  column  u[j,7r]  for  column  j  ; 
f*;  ■  b  U  lJ  '  '■  ]  *  '  1  ■ 

end  . 

:=  /..;/v/r: 

end 

end 


where 

U\j-  "j  =  ]T  Ijkilj*-  -  ■  ■  -  Ink)1 

k£rou.[j.r} 

and 

rotr[j.  r]  =  {lr  €  Struci(L}.)  |  ma/>[Ar]  =  r}  . 

All  of  the  update  columns  appearing  in  u[j.  7r ]  come  from  factor  columns  that  are 
mapped  to  processor  rr,  and  thus  u[j,i r]  can  be  computed  without  any  interprocessor 
communication.  These  aggregate  updates  are  then  sent  to  the  processor  map[j]  to 
which  column  j  is  assigned,  and  that  processor  subtracts  the  aggregate  updates  from 
A.j  and  scales  to  obtain  Lmj.  Our  implementation  of  fan-in,  Algorithm  3,  works  for 
any  mapping  and  network  topology.  Since  only  aggregate  update  columns  are  sent 
between  processors,  the  incorporation  of  domains  is  implicit. 

Most  of  the  floating-point  computations  are  in  the  assignments  (3),  which  is  a 
sparse  AXPY,  and  (4),  which  is  a  sparse  vector  add.  But  most  importantly,  each  is 
part  of  a  sequence  of  sparse  vector  operations  to  the  same  destination  vector,  so 
that  the  expansion  need  be  done  only  once  for  each  ).  Thus  we  would  expect  the 
kernels  for  fan-in  to  be  more  efficient  than  the  kernel  for  fan-out.  Moreover,  the 
fan-in  algorithm  can  be  implemented  in  terms  of  supernodal  updates,  in  which  case 
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most  of  the  assignments  (3)  can  be  implemented  as  dense  AXPY's  further  improving 
performance  (see  [4]). 

Algorithm  3  proceeds  one  column  at  a  time.  If  a  processor  working  on  column 
)  has  to  wait  for  some  aggregate  update  column  from  another  processor,  then  this 
formulation  keeps  it  idle  even  though  it  could  be  doing  useful  work  on  other  columns 
This  bottleneck  can  be  avoided  by  allowing  blocked  processors  to  compute  updates 
to  later  columns  while  they  wait  This  compute-ahead  strategy  further  enhances  the 
performance  (see  [2]) 

6  The  Distributed  Multifrontal  Algorithm 

Trie  multifrontal  method  [21 . 8]  has  been  implemented  on  both  shared  memory  [5.  6.  7 
and  distributed  memory  [16.  ISl  machines.  Indeed,  the  notion  of  multiple  fronts  con¬ 
veys  the  idea  of  performing  indepe  ident  elimination  from  many  different  "frontiers" 
in  the  associated  graph.  In  this  section  we  express  this  algorithm  in  the  same  terms 
that  were  used  to  describe  the  fan-out  and  fan-in  schemes,  and  then  relate  this  to  the 
more  standard  description  in  terms  of  frontal  matrons. 

If  >  is  a  separator  ami  j  G  > .  then  we  can  write  the  complete  update  for  column 

J  <X" 

»•!/•>■;  =  E  /,*!/,*.•••.<„* i7  =  v 

A:r-  V’-u -f  i  /  i  S'  child  »>{  S 

WiffTr 

tv-  >“]  =  E 

fc€7[S"i 

and  I  [S']  denotes  the  subtree  of  the  elimination  trc  rooted  at  the  highest  node  in 
the  separator  The  first  term  is  a  sum  of  aggregate  updates;  the  second  is  a  sum 
of  updates  computed  from  factor  columns  within  the  separator.  Moreover,  we  can 
express  the  aggregate  update  r[j.  s’'1,  recursively  as 

j  =  E  *'[>• -"]  +  E  (5) 

5"  child  of  S'  k<j 

which  is  again  a  sum  of  aggregate  updates  and  updates  computed  from  factor  columns 
within  a  separator.  Finallv.  we  see  that  if  j  6  5,  then  r[j,  5]  also  satisfies  equation 

(51- 

Algorithm  4  is  our  implementation  of  the  distributed  multifrontal  method;  it  works 
for  any  mapping  and  network  topo1  gv.  Note  that  if  S]  ^  0,  then  the  piocessor 
mop[_ ;,  5]  is  assigned  to  compute  v\j,  5].  Thus  map[j,  5]  =  map[j ]  for  j  G  5. 

The  multifrontal  method  is  normally  expressed  in  terms  of  the  frontal  matrix 
associated  with  each  separator  5.  The  aggregate  updates  v[j,  5]  all  have  the  same 
nonzero  structure  as  (/,*,-•-,  l„k)T  for  any  k  G  S',  k  <  j.  Thus,  if  we  delete  the  zero 
rows,  then  we  can  gather  these  aggregate  updates  into  a  single  lower  triangular  update 


4-  E  hk-  '  '  ■  ■  Ink  ) 

k<, 

*€>' 
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Algorithm  4  Distribute Muitifrontal  Factorization 

mycoL*  —  {  j  j  map[],Sr\  =  myname  for  some  separator  S'}  ; 
for  ’  £  my  cols  do 

L.J  -=  [Qjj . Gn.i) 

for  column  j  :  =  1  to  n  do 
if  j  £  my  col*  then 
begin 

while  not  all  updates  to  some  v [j.  S']  for  which 
rnap'j.  S'}  -  myname  ha.c  been  applied  do 

begin 

Receive  aggregate  update  r[t.  S'"  or  factor  column  L.k  : 
if  received  r[i.S"]  then 
begin 

Let  V'  =  pnrcn^S"]  : 

tins'}  :=  rit.5'1  +  f[t,  S"]  ;  (6) 

if  all  updates  to  r[t.  S)  have  been  applied  then 
Send  r-j.  S']  to  map[i .  pa rc r<  f  [S'j]  ; 

end 

else 

for  t  £  S'  with  Li c  ^  0  and  map[t.  S']  =  myvamt  do 

begin 

r[i.S'j  v[i.  S'] +  /tfc(/,k,  )r  ;  (7) 

if  all  updates  to  r[t,S']  have  been  applied  then 
Send  r[t.  S']  to  map|t . pnrcntJS']]  ; 

end 


end 

Let  S  be  the  separator  to  whirl)  j  belongs  ; 
if  map[j.S ]  =  mynamc  then 
begin 


L.j  —  L.j  r[j,  S]  ; 

Send  L.j  to  every  processor  tr  for  which  there  exists  t  >  j 
such  that  ^  0  and  mapli.S]  =  tt  ; 


end 

end 
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From 

To 

Pure 

Fan-out 

Domain 

Fan-out 

Fam-in 

Domain 

Multifrontal 

Domain 

Parent 

faictor 

update 

update 

update 

Domain 

Ancestor 

factor 

update 

update 

Separator 

Parent 

factor 

update 

update 

Separator 

Ancestor 

factor 

update 

- - 

Separator 

Self 

update 

factor 

Table  1:  Message  types  based  on  domain-separator  model 


matrix.  This  update  matrix  corresponds  to  the  frontal  matrix  for  5  and  represents 
the  sum  of  all  update  contributions  from  columns  in  the  separator  together  with 
descendant  columns  of  the  separator  in  the  elimination  tree. 

Most  of  the  floating-point  arithmetic  is  in  the  assignment  (7).  which  can  be  im¬ 
plemented  a s  a  dense  AXPY  since  t’[i,  S']  and  /,*(/,*.  •  •  • ,  lnk)T  have  the  same  nonzero 
structure.  The  assignment  (6).  which  is  a  sparse  AXPY,  represents  a  lower  order  term. 
Thus  we  would  expect  the  kernels  for  multifrontal  to  be  more  efficient  than  the  kernels 
for  either  fan-out  or  fan-in  (without  supernodal  updates). 

It  is  straight-forward  to  incorporate  domains  into  the  multifrontal  algorithm.  Let 
5  denote  the  root  separator  for  a  domain.  Then,  after  the  domain  has  been  fac¬ 
tored.  the  aggregate  updates  r[j.  51.  j  £  5.  can  be  computed  and  sent  to  processors 
map'j.  par(nt{S}].  We  shall  refer  to  this  variant  as  the  domain  multifrontal  algorithm. 

In  order  to  minimize  communication,  the  processor  map\j.  5]  assigned  to  compute 
v'j.S]  is  always  chosen  from  the  set  Q  —  E  S}  in  such  a  manner  as  to 

balance  the  load  among  the  processors  in  Q. 


7  Comparisons  of  Distributed  Algorithms 

We  have  described  several  distributed  factorization  schemes  in  the  last  three  sections. 
In  the  pure  fan-out  scheme,  processors  communicate  using  only  factor  columns.  In 
the  domain  fan-out  scheme,  aggregate  update  columns  are  used  to  pass  information 
from  domain  to  non-domain  nodes,  while  factor  columns  are  used  to  pass  information 
among  non-domain  nodes.  In  the  fan-in  scheme,  processors  communicate  using  only 
a8firegate  update  columns.  In  the  domain  multifrontal  method,  the  processors  use  a 
mixture  of  factor  columns  and  aggregate  update  columns. 

In  Table  1,  we  summarize  the  communication  aspects  of  these  schemes.  Here 
"Parent’'  refers  to  the  separator  immediately  above  the  domain  or  separator  in  the 
elimination  tree,  and  “Ancestor"  refers  to  a  non-parent  separator  lying  on  the  path 
from  the  domain  or  separator  to  the  root  of  the  elimination  tree.  Note  that  in  the 
multifrontal  method,  column  information  is  never  forwarded  from  a  domain  or  sep 
arator  directly  to  am  ancestor  separator;  it  must  be  passed  to  the  parent  separator, 
then  to  the  grand  parent  separator,  etc.,  until  it  reaches  the  destination. 
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No.of 

Distributed 

Total  Messages 

01 

Recv  Mesgs 

Proc. 

Algorithm 

min 

avg 

max 

max 

min 

Domain  Fan-out 

711 

650 

681 

3 

MM 

16 

Fan-in 

238 

220 

234 

I 

is 

Multifrontal 

329 

298 

318 

| 

WM 

172 

137 

Domain  Fan-out 

997 

825 

906 

513 

367 

484 

379 

32 

Fan-in 

295 

253 

272 

138 

133 

159 

117 

Multifrontal 

348 

307 

332 

186 

145 

184 

145 

Domain  Fan-out 

916 

614 

808 

1 

487 

248 

64 

Fan-in 

348 

246 

298 

■ 

lpl|l 

200 

96 

Multifrontal 

353 

252 

309 

193 

186 

118 

Table  2:  Average  message  counts  per  processor  for  the  63-by-63  grid  problem 


We  now  present  experimental  results  for  a  model  problem,  the  sparse  symmet¬ 
ric  positive  definite  system  associated  with  a  9- point  difference  operator  on  a  k-by-k 
regular  grid.  The  multiprocessor  is  an  Intel  iPSC/2  hypcrcube  with  Weitek  1167 
floating-point  chips  The  three  distributed  schemes  were  implemented  in  C.  Results 
are  not  presented  for  the  pure  fan-out  method  because  it  was  uniformly  and  substan¬ 
tially  worse  than  domain  fan-out. 

The  nested  dissection  ordering  was  used  to  order  the  variables,  since  it  gives 
optimal-order  fill  and  a  well-balanced  elimination  tree  [9],  A  subtree-to-subcubc 
mapping  [13]  was  used  to  assign  columns  to  processors.  We  chose  equal-size  domains 
(one  per  processor)  of  largest  possible  size,  since  that  choice  is  known  to  give  good  load 
balance  and  reduced  communication  [17].  Thus  the  columns/nodes  in  the  last  grid 
dissector  are  assigned  to  the  2d  processors  of  the  entire  hypercube,  and  the  nodes 
associated  with  each  of  the  two  remaining  subgrids  are  mapped  to  one  of  the  two 
( d  -  1  [-dimensional  subcubes  in  a  recursive  manner. 

For  each  of  the  three  distributed  factorization  schemes,  the  amount  of  message 
traffic  and  the  amount  of  parallel  numerical  computation  depends  on  the  mapping 
function.  Table  2  shows  the  actual  message  counts  for  a  63  x  63  grid.  Table  3  shows 
the  corresponding  message  volumes.  Figure  5  shows  the  average  message  volume 
for  a  63-by-63  grid  problem,  Figure  6  the  average  message  counts,  and  Figure  7  the 
efficiency  of  the  domain  fan-out,  fan-in,  and  domain  multifrontal  methods.  Table  4 
gives  speedups  for  these  schemes,  relative  to  a  state-of-the-art  serial  code,  also  written 
in  C. 

It  is  clear  from  these  experimental  results  that  domain  fan-out  is  inferior  to  the  fan- 
in  and  the  multifrontal  schemes  for  our  current  implementations.  The  fan-in  scheme 
requires  marginally  less  message  traffic  than  the  distributed  multifrontal  method,  but 
it  is  not  as  efficient. 
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Average 

Message 

Count 


Number  of  Processors 

Figure  6:  Average  message  counts  per  processor  for  the  63-by-63  grid  problem 


Figure  7:  Parallel  efficiency  for  the  »>3-by-63  grid  problem 
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